Skip to main content

Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2016)

Abstract

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the end-user and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.hpc.cineca.it/hardware/pico.

  2. 2.

    http://www.tpc.org/tpcds/.

References

  1. Amazon EC2 pricing. http://aws.amazon.com/ec2/pricing/

  2. The digital universe in 2020. http://idcdocserv.com/1414

  3. Aguilera-Mendoza, L., Llorente-Quesada, M.T.: Modeling and simulation of Hadoop distributed file system in a cluster of workstations. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 1–12. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41366-7_1

    Chapter  Google Scholar 

  4. Ahmed, S.T., Loguinov, D.: On the performance of MapReduce: a stochastic approach. In: IEEE International Conference on Big Data, pp. 49–54. IEEE (2014)

    Google Scholar 

  5. Alipour, H., Liu, Y., Gorton, I.: Model driven performance simulation of cloud provisioned Hadoop MapReduce applications. In: Proceedings of the 8th International Workshop on Modeling in Software Engineering, MiSE 2016 (2016)

    Google Scholar 

  6. Ardagna, D., Ghezzi, C., Mirandola, R.: Rethinking the use of models in software architecture. In: Becker, S., Plasil, F., Reussner, R. (eds.) QoSA 2008. LNCS, vol. 5281, pp. 1–27. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87879-7_1

    Chapter  Google Scholar 

  7. Baarir, S., Beccuti, M., Cerotti, D., De Pierro, M., Donatelli, S., Franceschinis, G.: The GreatSPN tool: recent enhancements. ACM SIGMETRICS PER 36(4), 4–9 (2009)

    Article  Google Scholar 

  8. Barbierato, E., Gribaudo, M., Iacono, M.: Modeling apache hive based applications in big data architectures. In: VALUETOOLS 2013 Proceedings (2013)

    Google Scholar 

  9. Bardhan, S., Menascé, D.: Queuing network models to predict the completion time of the map phase of MapReduce jobs. In: Proceedings of the Computer Measurement Group International Conference (2012)

    Google Scholar 

  10. Bertoli, M., Casale, G., Serazzi, G.: JMT: performance engineering tools for system modeling. SIGMETRICS Perform. Eval. Rev. 36(4), 10–15 (2009)

    Article  Google Scholar 

  11. Bruneo, D., Longo, F., Ghosh, R., Scarpa, M., Puliafito, A., Trivedi, K.S.: Analytical modeling of reactive autonomic management techniques in IAAS clouds. In: IEEE CLOUD 2015 Proceedings (2015)

    Google Scholar 

  12. Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014)

    Article  Google Scholar 

  13. Chu, W.W., Sit, C.M., Leung, K.K.: Task response time for real-time distributed systems with resource contentions. IEEE Trans. Softw. Eng. 17(10), 1076–1092 (1991)

    Article  MathSciNet  Google Scholar 

  14. Dubois, D.J., Casale, G.: OptiSpot: minimizing application deployment cost using spot cloud resources. Clust. Comput. 19, 1–17 (2016)

    Article  Google Scholar 

  15. Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: Cloud (2016)

    Google Scholar 

  16. Herodotou, H.: Hadoop performance models (2011)

    Google Scholar 

  17. Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)

    Article  Google Scholar 

  18. Jensen, K., Kristensen, L.M., Wells, L.: Coloured Petri nets and CPN tools for modelling and validation of concurrent systems. Int. J. Softw. Tools Technol. Transf. 9(3–4), 213–254 (2007)

    Article  Google Scholar 

  19. Jin, H., Qiao, K., Sun, X.H., Li, Y.: Performance under failures of MapReduce applications. In: CCGrid 2011 Proceedings (2011)

    Google Scholar 

  20. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)

    Article  Google Scholar 

  21. Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J., Ganger, G.R.: Applying performance models to understand data-intensive computing efficiency. Technical report, DTIC Document (2010)

    Google Scholar 

  22. Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group (2012)

    Google Scholar 

  23. Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance. Prentice-Hall, Upper Saddle River (1984)

    Google Scholar 

  24. Liang, D.R., Tripathi, S.K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)

    Article  Google Scholar 

  25. Lin, M., Zhang, L., Wierman, A., Tan, J.: Joint optimization of overlapping phases in MapReduce. SIGMETRICS Perform. Eval. Rev. 41(3), 16–18 (2013)

    Article  Google Scholar 

  26. Lin, X., Meng, Z., Xu, C., Wang, M.: A practical performance model for Hadoop MapReduce. In: 2012 IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS), pp. 231–239. IEEE (2012)

    Google Scholar 

  27. Mak, V.W., Lundstrom, S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–270 (1990)

    Article  Google Scholar 

  28. Marynowski, J.E., Santin, A.O., Pimentel, A.R.: Method for testing the fault tolerance of MapReduce frameworks. Comput. Netw. 86, 1–13 (2015)

    Article  Google Scholar 

  29. Nelson, R.D., Tantawi, A.N.: Approximate analysis of fork/join synchronization in parallel queues. IEEE Trans. Comput. 37(6), 739–743 (1988)

    Article  Google Scholar 

  30. Polo, J., Becerra, Y., Carrera, D., Steinder, M., Whalley, I., Torres, J., Ayguadé, E.: Deadline-based MapReduce workload management. IEEE Trans. Netw. Serv. Manag. 10(2), 231–244 (2013)

    Article  Google Scholar 

  31. Ruiz, M.C., Calleja, J., Cazorla, D.: Petri nets formalization of Map/Reduce paradigm to optimise the performance-cost tradeoff. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3, pp. 92–99. IEEE (2015)

    Google Scholar 

  32. Shanklin, C.: Benchmarking Apache Hive 13 for Enterprise Hadoop. https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

  33. Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for MapReduce environments. In: ICAC 2011 Proceedings (2011)

    Google Scholar 

  34. Vianna, E., Comarela, G., Pontes, T., Almeida, J.M., Almeida, V.A.F., Wilkinson, K., Kuno, H.A., Dayal, U.: Analytical performance models for MapReduce workloads. Int. J. Parallel Program. 41(4), 495–525 (2013)

    Article  Google Scholar 

  35. Yang, X., Sun, J.: An analytical performance model of MapReduce. In: CCIS 2011 (2011)

    Google Scholar 

  36. Yu, X., Li, W.: Performance modelling and analysis of MapReduce/Hadoop workloads. In: LANMAN 2015 Proceedings (2015)

    Google Scholar 

Download references

Acknowledgments

This work has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No. 644869 (DICE). Experimental data are available as open data at https://zenodo.org/record/58847#.V5i0wmXA45Q.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Danilo Ardagna .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Ardagna, D., Bernardi, S., Gianniti, E., Karimian Aliabadi, S., Perez-Palacin, D., Requeno, J.I. (2016). Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets. In: Carretero, J., Garcia-Blas, J., Ko, R., Mueller, P., Nakano, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10048. Springer, Cham. https://doi.org/10.1007/978-3-319-49583-5_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49583-5_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49582-8

  • Online ISBN: 978-3-319-49583-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics